Classification using String KernelsText Classi cation using String
نویسندگان
چکیده
We propose a novel approach for categorizing text documents based on the use of a special kernel. The kernel is an inner product in the feature space generated by all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences that are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be eeciently evaluated by a dynamic programming technique. Experimental comparisons of the performance of the kernel compared with a standard word feature space kernel Joachims (1998) show positive results on modestly sized datasets. The case of contiguous subsequences is also considered for comparison with the subsequences kernel with diierent decay factors. For larger documents and datasets the paper introduces an approximation technique that is shown to deliver good approximations eeciently for large datasets.
منابع مشابه
Some results about the use of tree/string edit distances in a nearest neighbour classi cation task
In pattern recognition there is a variety of applications where the patterns are classi ed using edit distance. In this paper we present some results comparing the use of tree and string edit distances in a handwritten character recognition task. Some experiments with di erent number of classes and of classi ers are done.
متن کاملJumping Emerging Substrings in Image Classification
We propose a new image classi cation scheme based on the idea of mining jumping emerging substrings between classes of images represented by visual features. Jumping emerging substrings (JES) are string patterns, which occur frequently in one set of string data and are absent in another. By representing images in symbolic manner, according to their color and texture characteristics, we enable m...
متن کاملStructural Representation of Speech for Phonetic Classi£cation
This paper explores the issues involved in using symbolic metric algorithms for automatic speech recognition (ASR), via a structural representation of speech. This representation is based on a set of phonological distinctive features which is a linguistically well-motivated alternative to the “beads-on-a-string” view of speech that is standard in current ASR systems. We report the promising res...
متن کاملEdit Distance for Ordered Vector Sets: A Case of Study
Digital contours in a binary image can be described as an ordered vector set. In this paper an extension of the string edit distance is de ned for its computation between a pair of ordered sets of vectors. This way, the di erences between shapes can be computed in terms of editing costs. In order to achieve e cency a dominant point detection algorithm should be applied, removing redundant data ...
متن کاملSome Results about the Use of Tree/String Edit Distances in a~Nearest Neighbour Classification Task
In pattern recognition there is a variety of applications where the patterns are classi ed using edit distance. In this paper we present some results comparing the use of tree and string edit distances in a handwritten character recognition task. Some experiments with di erent number of classes and of classi ers are done.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002